Offline Handwritten Script Identification in Document Images
نویسنده
چکیده
Automatic handwritten script identification from document images facilitates many important applications such as sorting, transcription of multilingual documents and indexing of large collection of such images, or as a precursor to optical character recognition (OCR). In this paper, we investigate a texture as a tool for determining the script of handwritten document image, based on the observation that text has a distinct visual texture. Further, K nearest neighbour algorithm is used to classify 300 text blocks as well as 400 text lines into one of the three major Indian scripts: English, Devnagari and Urdu, based on 13 spatial spread features extracted using morphological filters. The proposed algorithm attains average classification accuracy as high as 99.2% for bi-script and 88.6% for tri-script separation at text line and text block level respectively with five fold cross validation test. General Terms Pattern Recognition, Document Image Analysis
منابع مشابه
Handwritten Script Identification from a Bi-Script Document at Line Level using Gabor Filters
In a country like India where more number of scripts are in use, automatic identification of printed and handwritten script facilitates many important applications including sorting of document images and searching online archives of document images. In this paper, a Gabor feature based approach is presented to identify different Indian scripts from handwritten document images. Eight popular In...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملConvolution Based Technique for Indic Script Identification from Handwritten Document Images
Determination of script type of document image is a complex real life problem for a multi-script country like India, where 23 official languages (including English) are present and 13 different scripts are used to write them. Including English and Roman those count become 23 and 13 respectively. The problem becomes more challenging when handwritten documents are considered. In this paper an app...
متن کاملA new dataset of word-level offline handwritten numeral images from four official Indic scripts and its benchmarking using image transform fusion
Handwritten document image dataset development is one of the most tedious and time consuming tasks in optical character recogniser (OCR) related experimental work. Special attention need to be given in terms of feasibility, realness, clarity etc. while collecting real life data from different writers. Few efforts can be found in the literature for development of handwritten NIdb (numeral image ...
متن کاملAn improved offline handwritten character segmentation algorithm for Bangla script
Effective segmentation of offline handwritten word images of unconstrained handwritten Bangla script is a challenging problem in Optical Character Recognition (OCR) application. Presence of a continuous horizontal line called ‘Matra’ is an important feature of this script. However, in unconstrained cursive handwriting, Matra can be wavy or discontinuous, makes the problem of segmentation diffic...
متن کامل